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Abstract 

Stochastic algorithms are efficient approaches to solving machine learning and optimization 
problems. In this paper, we propose a general framework called Splash for parallelizing stochastic 
algorithms on multi-node distributed systems. Splash consists of a programming interface and 
an execution engine. Using the programming interface, the user develops sequential stochastic 
algorithms without concerning any detail about distributed computing. The algorithm is then 
automatically parallelized by a communication-efficient execution engine. We provide theoreti¬ 
cal justifications on the optimal rate of convergence for parallelizing stochastic gradient descent. 
Splash is built on top of Apache Spark. The real-data experiments on logistic regression, collab¬ 
orative filtering and topic modeling verify that Splash yields order-of-magnitude speedup over 
single-thread stochastic algorithms and over state-of-the-art implementations on Spark. 


1 Introduction 

Stochastic optimization algorithms process a large-scale dataset by sequentially processing random 
subsamples. This processing scheme makes the per-iteration cost of the algorithm much cheaper 
than that of batch processing algorithms while still yielding effective descent. Indeed, for convex 
optimization, the efficiency of stochastic gradient descent (SGD) and its variants has been estab¬ 
lished both in theory and in practice [41, 5, 38, 8, 35, 17]. For non-convex optimization, stochastic 
methods achieve state-of-the-art performance on a broad class of problems, including matrix fac¬ 
torization [19], neural networks [20] and representation learning [37]. Stochastic algorithms are 
also widely used in the Bayesian setting for finding approximations to posterior distributions; ex¬ 
amples include Markov chain Monte Carlo, expectation propagation [28] and stochastic variational 
inference [15]. 

Although classical stochastic approximation procedures are sequential, it is clear that they also 
present opportunities for parallel and distributed implementations that may yield significant ad¬ 
ditional speedups. One active line of research studies asynchronous parallel updating schemes in 
the setting of a lock-free shared memory [34, 9, 24, 44, 14]. When the time delay of concurrent 
updates are bounded, it is known that such updates preserve statistical correctness [1, 24]. Such 
asynchronous algorithms yield significant speedups on multi-core machines. On distributed systems 
connected by commodity networks, however, the communication requirements of such algorithms 
can be overly expensive. If messages are frequently exchanged across the network, the communi¬ 
cation cost will easily dominate the computation cost. 
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There has also been a flurry of research studying the implementation of stochastic algorithms 
in the fully distributed setting [45, 42, 30, 12, 23]. Although promising results have been reported, 
the implementations proposed to date have their limitations—they have been designed for specific 
algorithms, or they require careful partitioning of the data to avoid inconsistency. 

In this paper, we propose a general framework for parallelizing stochastic algorithms on multi¬ 
node distributed systems. Our framework is called Splash (System for Parallelizing Learning 
Algorithms with Stochastic Methods). Splash consists of a programming interface and an execution 
engine. Using the programming interface, the user develops sequential stochastic algorithms without 
thinking about issues of distributed computing. The algorithm is then automatically parallelized 
by the execution engine. The parallelization is communication efficient, meaning that its separate 
threads don’t communicate with each other until all of them have processed a large bulk of data. 
Thus, the inter-node communication need not be a performance bottleneck. 

Programming Interface The programming interface is designed around a key paradigm: im¬ 
plementing incremental updates that processes weighted data. Unlike existing distributed machine 
learning systems [6, 39, 21, 29] which requires the user to explicitly specify a distributed algorithm. 
Splash asks the user to implement a processing function that takes an individual data element as 
input to incrementally update the corresponding variables. When this function is iteratively called 
on a sequence of samples, it defines a sequential stochastic algorithm. It can also be called in a 
distributed manner for constructing parallel algorithms, which is the job of the execution engine. 
This programming paradigm allows one algorithmic module working on different computing en¬ 
vironments, no matter if it is a single-core processor or a large-scale cluster. As a consequence, 
the challenge of parallelizing these algorithms has been transferred from the developer side to the 
system side. 

To ensure parallelizability, the user is asked to implement a slightly stronger version of the 
base sequential algorithm: it needs to be capable of processing weighted samples. An m-weighted 
sample tells the processing function that the sample appears m times consecutively in the sequence. 
Many stochastic algorithms can be generalized to processing weighted samples without sacrificing 
computational efficiency. We will demonstrate SGD and collapsed Gibbs sampling as two concrete 
examples. Since the processing of weighted samples can be carried out within a sequential paradigm, 
this requirement does not force the user to think about a distributed implementation. 

Execution Engine In order to parallelize the algorithm. Splash converts a distributed process¬ 
ing task into a sequential processing task using distributed versions of averaging and reweighting. 
During the execution of the algorithm, we let every thread sequentially process its local data. The 
local updates are iteratively averaged to construct the global update. Gritically, however, although 
averaging reduces the variance of the local updates, it doesn’t reduce their bias. In contrast to the 
sequential case in which a thread processes a full sequence of random samples, in the distributed 
setting every individual thread touches only a small subset of samples, resulting in a significant 
bias relative to the full update. Our reweighting scheme addresses this problem by feeding the 
algorithm with weighted samples, ensuring that the total weight processed by each thread is equal 
to the number of samples in the full sequence. This helps individual threads to generate nearly- 
unbiased estimates of the full update. Using this approach, Splash automatically detects the best 
degree of parallelism for the algorithm. 
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Theoretically, we prove that Splash achieves the optimal rate of convergence for parallelizing 
SGD, assuming that the objective function is smooth and strongly convex. We conduct exten¬ 
sive experiments on a variety of stochastic algorithms, inclnding algorithms for logistic regression, 
collaborative filtering and topic modeling. The experiments verify that Splash can yield orders- 
of-magnitnde speedups over single-thread stochastic algorithms and over state-of-the-art batch 
algorithms. 

Besides its performance. Splash is a contribution on the distribnted computing systems front, 
providing a flexible interface for the implementation of stochastic algorithms. We build Splash on 
top of Apache Spark [40], a popular distributed data-processing framework for batch algorithms. 
Splash takes the standard Resilient Distributed Dataset (RDD) of Spark as input and generates an 
RDD as output. The data structure also supports default RDD operators such as Map and Reduce, 
ensuring convenient interaction with Spark. Because of this integration. Splash works seamlessly 
with other data analytics tools in the Spark ecosystem, enabling a single system to address the 
entire analytics pipeline. 

2 Shared and Local Variables 

In this paper, we focus on the stochastic algorithms which take the following general form. At 
step t, the algorithm receives a data element zt and a vector of shared variables vt- Based on these 
values the algorithm performs an incremental update A(zt,vt) on the shared variable: 

vt+i ^ vt +A{zt,vt) (1) 

For example, stochastic gradient descent (SGD) fits this general framework. Letting x denote a 
random data element x and letting w denote a parameter vector, SGD performs the update: 

t t + 1 and w w — rjtVi{w\x) (2) 

where i{-',x) is the loss function associated with the element and ijt is the stepsize at time t. In 
this case both w and t are shared variables. 

There are several stochastic algorithms using local variables in their compntation. Every local 
variable is associated with a specific data element. For example, the collapsed Gibbs sampling 
algorithm for LDA [13] maintains a topic assignment for each word in the corpus. Suppose that a 
topic k € {1,K} has been sampled for a word w, which is in document d. The collapsed Gibbs 
sampling algorithm updates the word-topic counter and the document-topic counter ridk by 

'^wk ^ ^wk T f and i '^dk T f • (^) 

The algorithm iteratively resample topics for every word until the model parameters converge. 

When a new topic is sampled for the word w, the following operation removes the old topic before 
drawing the new one: 

’^wk ^ '^wk I and rif^k ^ ‘^dk 1* (4) 

Update (3) and update (4) are executed at different stages of the algorithm but they share the 
same topic k. Thus, there should be a local variable associated with the word w storing the topic. 
Splash supports creating and updating local variables during the algorithm execution. 

The usage of local variables can sometimes be tricky. Since the system carries ont antomatic 
reweighting and rescaling (refer to Section 4.2), any improper usage of the local variable may 
cause inconsistent scaling issues. The system thus provides a more robust interface called “delayed 
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operator” which substitutes the functionality of local variables in many situations. In particular, 
the user can declare an operation such as (4) as a delayed operation and suspend its execution to 
the next time when the same element is processed. The scaling consistency of the delay operation 
is guaranteed by the system. 

Shared variables and local variables are stored separately. In particular, shared variables are 
replicated on every data partition. Their values are synchronized. The local variables, in contrast, 
are stored with the associated data elements and will never be synchronized. This storage scheme 
optimizes the communication efficiency and allows for convenient element-wise operations. 

3 Programming with Splash 

Splash allows the user to write self-contained Scala applications using its programming interface. 
The goal of the programming interface is to make distributed computing transparent to the user. 
Splash extends Apache Spark to provide an abstraction called a Parametrized RDD for storing and 
maintaining the distributed dataset. The Parametrized RDD is based on the Resilient Distributed 
Dataset (RDD) [40] used by Apache Spark. It can be created from a standard RDD object: 

val paramRdd = new ParainetrizedRDD(rdd). 

We provide a rich collection of interfaces to convert the components of Parametrized RDD to 
standard RDDs, facilitating the interaction between Splash and Spark. To run algorithms on the 
Parametrized RDD, the user creates a function called process which implements the stochastic 
algorithm, then calls the method 

paramRdd.run(process) 

to start running the algorithm. In the default setting, the execution engine takes a full pass over 
the dataset by calling run() once. This is called one iteration of the algorithm execution. The 
inter-node communication occurs only at the end of the iteration. The user may call run() multiple 
times to take multiple passes over the dataset. 

The process function is implemented using the following format; 

def process(elem : Any, weight : Int, sharedVar ; VarSet, localVar : VarSet){... } 

It takes four arguments as input: a single element elem, the weight of the element, the shared 
variable sharedVar and the local variable localVar associated with the element. The goal is to 
update sharedVar and localVar according to the input. 

Splash provides multiple ways to manipulate these variables. Both local and shared variables are 
manipulated as key-value pairs. The key must be a string; the value can be either a real number or 
an array of real numbers. Inside the process implementation, the value of local or shared variables 
can be accessed by localVar.get (key) or sharedVar.get (key). The local variable can be updated 
by setting a new value for it; 

localVar.set(key, value) 

The shared variable is updated by operators. For example, using the add operator, the expression 

sharedVar.add(key, delta) 

adds a scalar delta to the variable. The SGD updates (2) can be implemented via several add 
operators. Other operators supported by the programming interface, including delayed add and 
multiply, are introduced in Section 4.2. Similar to the standard RDD, the user can perform map and 
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reduce operations directly on the Parametrized RDD. For example, after the algorithm terminates, 
the expression 

val loss = paramRdd.map(evalLoss).sum() 
evaluates the element-wise losses and aggregates them across the dataset. 


4 Strategy for Parallelization 

In this section, we first discuss two naive strategies for parallelizing a stochastic algorithm and their 
respective limitations. These limitations motivate the strategy that Splash employs. 


4.1 Two naive strategies 


We denote by A (5) the incremental update on variable v after processing the set of samples S. 
Suppose that there are m threads and each thread processes a subset Si of S. If the i-th thread 
increments the shared variable by A(5j), then the accumulation scheme constructs a global update 
by accumulating local updates: 

m 

Wew ^old T A(S'j). (5) 

i=l 


The scheme (5) provides a good approximation to the full update if the batch size \Di\ is suffi¬ 
ciently small [1]. However, frequent communication is necessary to ensure a small batch size. For 
distributed systems connected by commodity networks, frequent communication is prohibitively 
expensive, even if the communication is asynchronous. 

Applying scheme (5) on a large batch may easily lead to divergence. Taking SGD as an example: 
if all threads starts from the same vector ruoid; then after processing a large batch, the new vector 
on each thread will be close to the optimal solution w*. If the variable is updated by formula (5), 
then we have 

m m 

Wnew -W* = Wo\d “ + X] ~ - W* + '^{w* - Wold) = {m - I)(u;* - Wold)- 

i=l i=l 

Clearly SGD will diverge if m > 3. 

One way to avoid divergence is to multiply the incremental change by a small coefficient. When 
the coefficient is 1/m, the variable is updated by 


^new 


^ HL 

Void +-y^HSi)- 
m 


( 6 ) 


i=\ 

This averaging scheme usually avoids divergence. However, since the local updates are computed on 
\/jri^h q£ they make little progress comparing to the full sequential update. Thus the algorithm 
converges substantially slower than its sequential counterpart after processing the same amount of 
data. See Section 4.4 for an empirical evidence of this claim. 


4.2 Our strategy 

We now turn to describe the Splash strategy for combining parallel updates. First we introduce 
the operators that Splash supports for manipulating shared variables. Then we illustrate how 
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conflicting updates are combined by the reweighting scheme. 


Operators The programming interface allows the user to manipulate shared variables inside their 
algorithm implementation via operators. An operator is a function that maps a real number to 
another real number. Splash supports three types of operators: add, delayed add and multiply. The 
system employs different strategies for parallelizing different types of operators. 

The add operator is the the most commonly used operator. When the operation is performed 
on variable v, the variable is updated by u •(— u + <5 where 5 is a user-specified scalar. The SGD 
update ( 2 ) can be implemented using this operator. 

The delayed add operator performs the same mapping v v + 6] however, the operation will 
not be executed until the next time that the same element is processed by the system. Delayed 
operations are useful in implementing sampling-based stochastic algorithms. In particular, before 
the new value is sampled, the old value should be removed. This “reverse” operation can be 
declared as a delayed operator when the old value was sampled, and executed before the new value 
is sampled. See Section 4.3 for a concrete example. 

The multiply operator scales the variable by u ■(— 7 • u where 7 is a user-specified scalar. The 
multiply operator is especially efficient for scaling high-dimensional arrays. The array multiplication 
costs 0(1) computation time, independent of the dimension of the array. The fast performance 
is achieved by a “lazy update” scheme. For every array u, there is a variable V maintaining the 
product of all multipliers applied to the array. The multiply operator updates G •(— 7 • F with 0(1) 
time. For the i-th element Ui, a variable F maintains the product of all multipliers applied to the 
element. When the element is accessed, the system updates Ui and F by 

F 

Ui^ — - Ui and F ^ F. 

In other words, we delay the multiplication on individual element until it is used by the program. 
As a consequence, those infrequently used elements won’t be a bottleneck on the algorithm’s per¬ 
formance. See Section 4.3 for a concrete example. 

Reweighting Assume that there are m thread running in parallel. Note that all Splash operators 
are linear transformations. When these operators are applied sequentially, they merge into a 
single linear transformation. Let Si be the sequence of samples processed by thread i, which is 
a fraction 1/m of the full sequence S. For an arbitrary shared variable v, we can write thread z’s 
transformation of this variable in the following form; 

v^T(Si)-v + A(S,) + T(Si), (7) 

Here, both F(S'j), A(5j) and T(Si) are thread-level operators constructed by the execution engine: 
r(S'j) is the aggregated multiply operator, A(Si) is the term resulting from the add operators, and 
T(Si) is the term resulting from the delayed add operators executed in the current iteration. A 
detailed construction of r(S'j), A(Si) and T(Si) is given in Appendix A. 

As discussed in Section 4.1, directly combining these transformations leads to divergence or 
slow convergence (or both). The reweighting scheme addresses this dilemma by assigning weights 
to the samples. Since the update (7) is constructed on a fraction 1/m of the full sequence S, we 
reweight every element by m in the local sequence. After reweighting, the data distribution of Si 
will approximate the data distribution of S. If the update (7) is a (randomized) function of the 
data distribution of Si, then it will approximate the full sequential update after the reweighting, 
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thus generating a nearly unbiased update. 

More concretely, the algorithm manipulates the variable by taking sample weights into account. 
An m-weighted sample tells the algorithm that it appears m times consecutively in the sequence. 
We rename the transformations in (7) by r(m5j), A{mSi) and T{mSi), emphasizing that they are 
constructed by processing m-weighted samples. Then we redefine the transformation of thread i 

by 

V ■(— T{mSi) ■ V + A{mSi) + T{mSi) (8) 

and dehne the global update by 

^ m m 

'^new — ~ {T{mGi) ■ Uoid + A(mS'j)^ + ^ T{mSi). (9) 

i=l i=l 

Equation (9) combines the transformations of all threads. The terms T[mSi) and A(mS'j) are 
scaled by a factor 1/m because they were constructed on m times the amount of data. The term 
T{mSi) is not scaled, because the delayed operators were declared in earlier iterations, independent 
of the reweighting. Finally, the scaling factor 1/m should be multiplied to all delayed operators 
declared in the current iteration, because these delayed operators were also constructed on m times 
the amount of data. 

Determining the degree of parallelism To determine the thread number m, the execution 
engine partitions the available cores into different-sized groups. Suppose that group i contains m* 
cores. These cores will execute the algorithm tentatively on m* parallel threads. The best thread 
number is then determined by cross-validation and is dynamically updated. The cross-validation 
requires the user to implement a loss function, which takes the variable set and an individual data 
element as input to return the loss value. See Appendix B for a detailed description. To find the 
best degree of parallelism, the base algorithm needs to be robust in terms of processing a wide 
range of sample weights. 

4.3 Generalizing stochastic algorithms 

Many stochastic algorithms can be generalized to processing weighted samples without sacrificing 
computational efficiency. The most straightforward generalization is to repeat the single-element 
update m times. For example, one can generalize the SGD updates (2) by 

t t + m and ww — rjt^rn^ (10) 

where := Vi is the sum of all stepsizes in the time interval [t — m + l,t], and r/j is 

the stepsize for the unit-weight sequential SGD. If m is large, computing rjt^m might be expensive. 
We may approximate it by 

rt+l 

'nt,m ~ / Vzdz 

J t—m+1 

if the right-hand side has a closed-form solution, or simply approximate it by r]t^m ~ mrif 

In many applications, the loss function f (re; x) can be decomposed as f (re; x) := f{w,x) + ^\\w\\\ 
where the second term is the ^ 2 -norm regularization. Thus, we have Vf(u);x) = V f{w,x) -|- Xw. If 
the feature vector x is sparse, then V f{w, x) is usually sparse as well. In this case, we have a more 
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efficient implementation of (10): 

t •(— t + m (via add operator), 

re •(— (1 — ■ w (via multiply operator), 

w w — f{w; x) (via add operator). 

Note that the multiply operator has complexity 0(1). Thus, the overall complexity is proportional 
to the number of non-zero components of V/(rc;x). If Vf{w,x) is a sparse vector, then this 
update will be more efficient than (10). It demonstrates the benefit of combining different types of 
operators. 

Note that equation (10) scales the stepsize with respect to m, which might be unsafe m is 
very large. Karampatziakis and Langford [18] propose a robust approach to dealing with large 
importance weights in SGD. The programming interface allows the user to implement the approach 
by Karampatziakis and Langford [18]. 

We take the collpased Gibbs sampling algorithm for LDA as a second example. The algorithm 
iteratively draw a word w from document d, and sample the topic of w by 

P(topic = k\d, w) (X -——- . (11) 

Uk + pW 

Here, W is the size of the vocabulary; ridk is the number of words in document d that has been 
assigned topic k] n^k is the total number of times that word w is assigned to topic k and Uk ■= 
'^■uj nwk- The constants a and /3 are hyper-parameters of the LDA model. When a topic k is 
sampled for the word, the algorithm updates n^k and Udk by (3). When a new topic will be 
sampled for the same word, the algorithm removes the old topic k by (4). If the current word has 
weight m, then we can implement the algorithm by 

^wk ^ n^k + nr and Udk ^ ndk + rn (via add operator), (12) 

~ nr and ndkUdk — (via delayed add operator). (13) 

As a consequence, the update (12) will be executed instantly. The update (13) will be executed at 
the next time when the same word is processed. 

4.4 A toy example 

We present a toy example illustrating the strategy described in Section 4.2. Gonsider the fol¬ 
lowing convex optimization problem. There are N = 3, 000 two-dimensional vectors represented 
by xi,... ,xn, such that Xi is randomly and independently drawn from the normal distribution 
X ~ A^(0, 12 x 2 )- The goal is to find a two-dimensional vector w which minimizes the weighted 
distance to all samples. More precisely, the loss function on sample Xi is defined by 

i{w; Xi) := (xi - w)'^ ( ^ ) (xj - ru) 

and the overall objective function is L(w) := J2^i-^(uj;xi). We want to find the vector that 

minimizes the objective function L{w). 

We use the SGD update (10) to solve the problem. The algorithm is initialized by wq = 
(-1,-1)^ and the stepsize is chosen by r]t = llVt- For parallel execution, the dataset is evenly 
partitioned into m = 30 disjoint subsets, such that each thread accesses to a single subset, containing 
1/30 faction of data. The sequential implementation and the parallel implementations are compared 
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Figure 1: Comparing parallelization schemes on a simple convex optimization problem. Totally 
N = 3 ,000 samples are partitioned into m = 30 batches. Each batch is processed by an independent 
thread running stochastic gradient descent. Each thread uses either unit-weight data or weighted 
data (weight = 30). The local solutions are combined by either averaging or accumulation. From 
the plot, we hnd that combining weighted solutions achieves the best performance. 


in Figure 1. Specifically, we compare seven types of implementations defined by different strategies: 

(a) The exact minimizer of L[w). 

(b) The solution of SGD achieved by taking a full pass over the dataset. The dataset contains 
N = 3 ,000 samples. 

(c) The local solutions by 30 parallel threads. Each thread runs SGD by taking one pass over its 
local data. The local dataset contains 100 samples. 

(d) Averaging local solutions in (c). This is the averaging scheme described by formula (6). 

(e) Aggregating local solutions in (c). This is the accumulation scheme described by formula (5). 

(f) The local solution by 30 parallel threads processing weighted data. Each element is weighted 
by 30. Each thread runs SGD by taking one pass over its local data. 

(g) Gombining parallel updates by formula (9), setting sample weight m = 30. Under this setting, 
formula (9) is equivalent to averaging local solutions in (f). 

In Figure 1, we observe that solution (b) and solution (g) achieve the best performance. Solution 
(b) is obtained by a sequential implementation of SGD: it is the baseline solution that parallel 
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algorithms target at approaching. Solution (g) is obtained by Splash with the reweighting scheme. 
The solutions obtained by other parallelization schemes, namely solution (d) and (e), have poor 
performances. In particular, the averaging scheme (d) has a large bias relative to the optimal 
solution. The accumulation scheme (e) diverges far apart from the optimal solution. 

To see why Splash is better, we compare local solutions (c) and (f). They correspond to the 
unweighted SGD and the weighted SGD respectively. We find that solutions (c) have a significant 
bias but relatively small variance. In contrast, solutions (f) have greater variance but much smaller 
bias. It verifies our intuition that reweighting helps to reduce the bias by enlarging the local dataset. 
Note that averaging reduces the variance but doesn’t change the bias. It explains why averaging 
works better with reweighting. 


5 Convergence Analysis 


In this section, we study the SGD convergence when it is parallelized by Splash. The goal of SGD 
is to minimize an empirical risk function 

' ' xes 

where S' is a fixed dataset and w G is the vector to be minimized over. Suppose that there are 
m threads running in parallel. At every iteration, thread i randomly draws (with replacement) a 
subset of samples Si of length n from the dataset S. The thread sequentially processes Si by SGD. 
The per-iteration update is 


t^t + m and ww + (jlwi'w — mrjt'Vi{w]x)) — w), (14) 

where the sample weight is equal to m. We have generalized the update (10) by introducing nw(-) 
as a projector to a feasible set W of the vector space. Projecting to the feasible set is a standard 
post-processing step for an SGD iterate. At the end of the iteration, updates are synchronized by 
equation (9). This is equivalent to computing: 


^new — ^old T UlTl and Wjiew 


^ m 

— (^i^oid + A(mA 

m \ 

i=l 


(15) 


We denote by w* := argminu,gvK ^(tc) the minimizer of the objective function, and denote by w'^ 
the combined vector after the T-th iteration. 


General convex function For general convex functions, we start by introducing three additional 
terms. Let wfj be the value of vector w at iteration k, when thread i is processing the j-th element 
of Si- Let be the stepsize associated with that update. We define a weighted average vector: 


—T 
w = 




j 


V-' V”* 

2 _/fc=l 2--ii=l 2_yj=l 'li,j 


Note that can be computed together with w'^. For general convex L, the function value L{nF) 
converges to L{w*). See Appendix G for the proof. 


Theorem 1. Assume that || V^(t(;; x) ||2 is bounded for all {w,x) G VF x 5. Also assume that r]t is 
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a monotonically decreasing funetion of t such that rg = oo and Vt < Then we have 

lim E[L(IZJ^) - L{w*)] = 0. 

T —>-00 


Smooth and strongly convex function We now turn to study smooth and strongly convex 
objective functions. We make three assumptions on the objective function. Assumption A restricts 
the optimization problem in a bounded convex set. Assumption B and Assumption C require the 
objective function to be sufficiently smooth and strongly convex in that set. 


Assumption A. The feasible set W C is a eompact convex set of finite diameter R. Moreover, 
w* is an interior point ofW; i.e., there is a set Up ■.= {w : \\w — w*\\2 < p} such that Up CW. 

Assumption B. There are finite constants L, G and H such that ||V^L(rt;;x) —i{w* \ x )\\2 < 
L\\w — rc*||2, ||V£(r(;;x)||2 < G and ||V^£(i(;; x) II2 < H for all (w,x) G W x S'. 

Assumption C. The objeetive function L is X-strongly convex over the space W, meaning that 
V‘^L{w) ^ Xldxd for all w G W. 

As a preprocessing step, we construct an Euclidean ball B of diameter D := j pi^ which 

contains the optimal solution w* . The ball center can be found by running the sequential SGD for 
a constant number of steps. During the Splash execution, if the combined vector ^ B, then we 
project it to B, ensuring that the distance between w'^ and w* is bounded by D. Introducing this 
projection step simplifies the theoretical analysis, but it may not be necessary in practice. 

Under these assumptions, we provide an upper bound on the mean-squared error of . The 
following theorem shows that the mean-square error decays as l/(rmn), inversely proportionally to 
the total number of processed samples. It is the optimal rate of convergence among all optimization 
algorithms which relies on noisy gradients [33]. See Appendix D for the proof. 


Theorem 2 . Under Assumptions A-C, if we choose the stepsize 
mean-squared error: 




< 


4G^ Cl 

X^Tmn 



+ 


C2 

rn2’ 


then the output has 


(16) 


where Gi and C 2 are constants independent ofT, m and n. 


When the local sample size n is sufficiently larger than the thread number m (which is typically 
true), the last two terms on the right-hand side of bound (16) are negligibly small. Thus, the 
mean-squared error is dominated by the first term, which scales as l/(Tmn). 


6 Experiments 

In this section, we report the empirical performance of Splash on three machine learning tasks: 
logistic regression, collaborative filtering and topic modeling. Our implementation of Splash runs 
on an Amazon EC2 cluster with eight nodes. Each node is powered by an eight-core Intel Xeon E5- 
2665 with 30GB of memory and was connected to a commodity 1GB network, so that the cluster 
contains 64 cores. For all experiments, we compare Splash with MLlib vl.3 [27] — the official 
distributed machine learning library for Spark. We also compare Splash against single-thread 
stochastic algorithms. 
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(a) Loss function convergence (b) Splash speedup rates 


Figure 2: Multi-class logistic regression on the MNIST 8M digit recognition dataset, (a) The 
convergence of different methods; (b) The speedup over other methods for achieving the same loss 
function value. 


6.1 Logistic Regression 

We solve a digit recognition problem on the MNIST 8M dataset [25] using multi-class logistic 
regression. The dataset contains 8 million hand-written digits. Each digit is represented by a 
feature vector of dimension d = 784. There are ten classes representing the digits 0-9. The goal is 
to minimize the following objective function: 

.. n 9 

i=l k=0 

where Xi G is the feature vector of the i-ih element and yi G {0,..., 9} is its label. The vectors 
wq, ... ,wg G are parameters of the logistic regression model. 

Splash solves the optimization problem by SGD. We use equation (10) to generalize SGD to pro¬ 
cessing weighted samples (the stepsize rjt,m is approximated by mrjt). The stepsize rjt is determined 
by the adaptive subgradient method (AdaGrad) [8]. We compare Splash against the single-thread 
SGD (with AdaGrad) and the MLlib implementation of L-BFGS [31]. Note that MLlib also pro¬ 
vides a mini-batch SGD method, but in practice we found it converging snbstantially slower than 
L-BFGS. 

Figure 2(a) shows the convergence plots of the three methods. Splash converges in a few seconds 
to a good solution. The single-thread AdaGrad and the L-BFGS algorithm converges to the same 
accuracy in much longer time. Figure 2(b) demonstrates Splash’s speedup over other methods. 
When the target loss decreases, the speedup rate over the single-thread SGD grows larger, while 
the speedup rate over MLlib drops lower. Thus, Splash is 15x - 30x faster than MLlib. Note that 
Splash runs a stochastic algorithm and L-BFGS is a batch method. It highlights the advantage of 
the stochastic method in processing large dataset. 

6.2 Collaborative Filtering 

We now turn to a personalized movie recommendation task. For this task, we use the Netflix prize 
dataset [3], which contains 100 million movie ratings made by 480k users on 17k movies. We split 
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the dataset randomly into a training set and a test set, which contains 90% and 10% of the ratings 
respectively. The goal is to predict the ratings in the test set given ratings in the training set. 

The problem can be solved nsing collaborative filtering. Assnme that each user i is associated 
with a latent vector Ui G and each movie J is associated with a latent vector vj G The 
affinity score between the user and the movie is measure by the inner product {ui,Vj). Given ratings 
in the training set, we define the objective function by: 

L{{ui},{vj}) := -rijf + X\\ui\\l +X\\vj\\l) , (17) 

{i,j,rij)€S 

where S represents the training set; The triplet {i,j,rij) represents that the user i gives rating rij 
to the movie j. In the training phase, we ht the user vectors {uj} and the movie vectors {vj} by 
minimizing (17). In the testing phase, we predict the ratings of a user i which might not be in the 
training set. Let {rij}j^j be the observed ratings from the user, we compute the user vector Ui by 

Ui := arg min Y{{u, vj) - rijf + A||n|| 2 , 
jeJ 

then predict the ratings on other movies by {ui,Vj). The prediction loss is measured by the mean- 
squared error. 

To minimize the objective function (17), we employ a SGD method. Let I be the set of users. 
Let Ji represents the set of movies that user i has rated in the training set. We define an objective 
function with respect to {vj} as: 


L{{vj}) := min L{{ui},{vj}) 

{Ui} 



— r. 




+ . 


u 


j&Ji 


;| + A 


E 

jGJi 


(18) 


so that the movie vectors are obtained by minimizing (18). SGD suffices to solve this problem 
because the objective function is a sum of individual losses. 

In practice, we choose the dimension d = 100 and regularization parameter A = 0.02. Thus the 
number of parameters to be learned is 65 million. The movie vectors are initialized by a random 
unit vector on the “hrst quadrant” (all coordinates are positive). Splash runs the generalized SGD 
algorithm (10) with stepsizes determined by AdaGrad. We compare Splash against the single¬ 
thread SGD method and the MLlib implementation of alternating least square (ALS) method. The 
ALS method minimizes the objective function (17) by alternating minimization with respect to 
{ui} and with respect to {vj}. 

According to Figure 3, Splash converges much faster than the single-thread SGD and the ALS. 
This is because that SGD can learn accurate movie vectors by processing a fraction of the the data. 
For example, to achieve a prediction loss lower than 0.70, it takes Splash only 13 seconds, processing 
60% of the training set. To achieve the same prediction loss, it takes the ALS 480 seconds, taking 
40 passes over the full training set. In other words. Splash features a 36x speedup over the MLlib. 


6.3 Topic Modeling 

We use the NYTimes article dataset from the UGI machine learning repository [22]. The dataset 
contains 300k documents and 100 million word tokens. The vocabulary size is 100k. The goal 
is to learn K = 500 topics from these documents. Each topic is represented by a multinomial 
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Figure 3: Collaborative filtering on the Netflix prize dataset. 



runtime (seconds) 


Figure 4: Topic modeling on the NYTimes dataset. The LDA model learns K = 500 topics. 


distribution of words. The number of parameters to be learned is 200 million. 

We employ the LDA model [4] and choose hyper-parameters a = /3 = 0.1. Splash runs the gener¬ 
alized collapsed Gibbs sampling algorithm (12)-(13). We also use the over-sampling technique [43], 
that is, for each word the algorithm independently samples 10 topics, each topic carrying 1/10 of 
the word’s weight. We compare Splash with the single-thread collapsed Gibbs sampling algorithm 
and the MLlib implementation of the variational inference (VI) method [4]. 

To evaluate the algorithm’s performance, we resort to the predictive log-likelihood metric by 
Hoffman et ah [15] . In particular, we partition the dataset into a training set S and a test set T. 
The test set contains 10k documents. For each test document in T, we partition its words into a 
set of observed words 'Wobs and held-out words 'Who) keeping the sets of unique words in Wobs and 
Who disjoint. We learn the topics from the training data 5, and then use that knowledge and the 
word set Wobs to estimate the topic distribution for the test documents. Finally, the predictive 
log-likelihood of the held-out words, namely logp(wnew|wobs) 5), are computed. The performance 
of the algorithm is measured by the average predictive log-likelihood per held-out word. 

Figure 4 plots the predictive log-likelihoods. Among the three methods, the single-thread col¬ 
lapsed Gibbs sampling algorithm exhibits little progress in the first 3,000 seconds. But when the 
algorithm is parallelized by Splash, it converges faster and better than the MLlib implementation 
of variational inference (VI). In particular, Splash converges to a predictive log-likelihoods of -8.12, 
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MNIST 8M (LR) Netflix (CF) NYTimes (LDA) 


Figure 5: Runtime of Splash for taking one pass over the training set. Three machine learning 
tasks: logistic regression (LR), collaborative filtering (CF) and topic modeling (LDA) 


while MLlib converges to -8.36. When measured at fixed target scores, Splash is 3x - 6x faster than 
MLlib. 

6.4 Runtime Analysis 

The runtime of a distributed algorithm can be decomposed into three parts: the computation 
time, the waiting time and the communication time. The waiting time is the latency that the 
fast threads wait for the slowest thread. The communication time is the amount of time spent on 
synchronization. 

We present runtime analysis on the three machine lear.ning tasks. For logistic regression and 
collaborative filtering, we let Splash workers synchronize five times per taking one pass over the 
training set. For topic modeling, we let the workers synchronize once per taking one pass. Figure 5 
breaks down the runtime of Splash into the three parts. For the three tasks, the waiting time is 16%, 
21% and 26% of the computation time. This ratio will increase if the algorithm is parallelized on 
more machines. In contrast, the communication time is 6%, 39% and 103% of the computation time 
— it is proportional to the number of parameters to be learned and logarithmically proportional to 
the number of workers (via TreeReduce). The communication time can be reduced by decreasing 
the synchronization frequency. 

7 Related Work 

Distributed machine learning systems have been implemented for a variety of applications and are 
based on different programming paradigms. Related systems include parameter servers [32, 6, 2, 21], 
Petuum [39], Naiad [29] and GraphLab [26]. There are also machine learning systems built on 
existing platforms, including Mahout [11] based on Hadoop [10] and MLI [36] based on Spark [40]. 

To the best of our knowledge, none of these systems are explicitly designed for parallelizing 
stochastic algorithms. Mahout and MLI, both adopting the iterative MapReduce [7] framework, are 
designed for batch algorithms. The parameter servers, Petuum and Naiad provide user-definable 
update primitives such as (get, set) on variables or (pull, push) on messages, under which a 
distributed stochastic algorithm can be implemented. However, a typical stochastic algorithm 
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updates its parameters in every iteration, which involves expensive inter-node communication. In 
practice, we found that the per-iteration computation usually takes a few microseconds, but pushing 
an update from one Amazon EC2 node to another takes milliseconds. Thus, the communication 
cost dominates the computation cost. If the communication is asynchronous, then the algorithm 
will easily diverge because of the significant latency. 

GraphLab asynchronously schedules communication using a graph abstraction, which guaran¬ 
tees the serializability of the algorithm. Many stochastic algorithms can be written as a graph-based 
program. For the SGD algorithm, one constructs a vertex for every sample and every feature, and 
connects an edge between the vertices if the sample processes a particular feature. However, when 
the individual feature is shared among many samples, running SGD on this graph will cause many 
conflicts, which significantly restricts the degree of parallelism. Such a paradigm is efficient only if 
the feature is very sparse. 

Splash doesn’t guarantee a bounded delay model like Petuum, nor does it guarantee the serial¬ 
izability like GraphLab. Instead, it provides a programming interface with limited operations (add, 
multiply and delayed add) and asks the algorithm to process weighted data. As a consequence, 
Splash exhibits descent performance in parallelizing stochastic algorithms. 

Apart from distributed systems literature, there is a flurry of research studying communication- 
efficient methods for convex optimization. Some of them applies to stochastic algorithms. Zinkevich 
et al. [45] study the one-shot averaging scheme for parallelizing SGD. Jaggi et al. [16] present a 
framework for parallelizing stochastic dual coordinate methods. Both methods can be implemented 
on top of Splash. Our theoretical analysis on SGD generalizes that of Zinkevich et al. [45]. In 
particular, the results of Section 5 assume that the parallelized SGD is synchronized for multiple 
rounds, while Zinkevich et al. [45] let the algorithm synchronize only at the end. The multi¬ 
round synchronization scheme is more robust when the objective function is not strongly convex. 
But its theoretical analysis is challenging, because the updates on separate threads are no longer 
independent. 

8 Conclusion 

In this paper, we have described Splash, a general framework for parallelizing stochastic algorithms. 
The programming paradigm of Splash is designed around a key concept: implementing incremental 
updates that processes weighted data. This paradigm allows the system to automatically parallelize 
the algorithm on commodity clusters. On machine learning tasks, Splash is order-of-magnitude 
faster than state-of-the-art implementations adopting the iterative MapReduce. The fast perfor¬ 
mance is partially due to the superiority of stochastic algorithms over the batch algorithm, and 
partially due to the communication-efficient feature of the system. In addition. Splash is built on 
top of Spark which allows it seamlessly integrating with the existing data analytics stack. 


Appendix 

A Constructing Linear Transformation on a Thread 

When element-wise operators are sequentially applied, they merge into a single linear transforma¬ 
tion. Assume that after processing a local subset S, the resulting transformation can be represented 
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by 

v^T{S)-v + A{S) + T{S) 

where r(5) is the scale factor, A(5) is the term resulting from the element-wise add operators, and 
T{S) is the term resulting from the element-wise delayed add operators declared before the last 
synchronization. 

We construct r(5), A(5) and T{S) incrementally. Let P be the set of processed elements. At 
the beginning, the set of processed elements is empty, so that we initialize them by 

r(P) = 1, A(P) = 0 and ^{P) = 0 for P = 0. 

After processing element we assume that the user has performed all types of operations, resulting 
in a transformation taking the form 

v-<^j{v + t) + S (19) 

where the scalars 7 and 6 result from instant operators and t results from the delayed operator. 
Concatenating transformation (19) with the transformation constructed on set P, we have 

u ^ 7 • (r(p) • V + A{P) + r(p) + 1 ) + <5 

= 7 • r(P) • u + (7 • A(P) + 5 ) + (7 • T{P) + 7 t) . 

Accordingly, we update the terms L, A and T by 

r(PU{ 2 }) = 7 -r(P), A(PU{z}) = 7 -A(P) + (5 and r(PU{z}) = 7 -T(P)+ 7 t (20) 

and update the set of processed elements by P PU{z}. After processing the entire local subset, 
the set P will be equal to S, so that we obtain r(S'), A{S) and T{S). 

B Determining Thread Number 

Suppose that there are M available cores in the cluster. The execution engine partitions these cores 
into several groups. Suppose that the Lth group contains rui cores. The group sizes are determined 
by the following allocation scheme: 

• Let 4mo be the thread number adopted by the last iteration. Let 4mo := 1 at the hrst 
iteration. 

• For i = 1,2,..., if Srrii-i < M — ^be let rrii := 4mi_i. Otherwise, let m* := 

M — Terminate when mj = M. 

It can be easily verihed that the candidate thread numbers (which are the group sizes) in the 
current iteration are at least as large as that of the last iteration. The candidate thread numbers 
are 4mo, 16mo, •.. until they consume all of the available cores. 

The Lth group is randomly allocated with Parametrized RDD partitions for training, and 
allocated with another rrii Parametrized RDD partitions for testing. In the training phase, they 
execute the algorithm on rrii parallel threads, following the parallelization strategy described in 
Section 4.2. In the testing phase, the training results are broadcast to all the partitions. The 
thread number associated with the smallest testing loss will be chosen. The user is asked to 
provide an evaluation function i -.W x 5" —>• M which maps a variable-sample pair to a loss value. 
This function, for example, can be chosen as the element-wise loss for optimization problems, or the 
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negative log-likelihood of probabilistic models. If the user doesn’t specify an evaluation function, 
then the largest rrii will be chosen by the system. 

Once a thread number is chosen, its training result will be applied to all Parametrized RDD 
partitions. The allocation scheme ensures that the largest thread number is at least 3/4 of M. 
Thus, in case that M is the best degree of parallelism, the computation power will not be badly 
wasted. The allocation scheme also ensures that M will be the only candidate of parallelism if the 
last iteration’s thread number is greater than M/2. Thus, the degree of parallelism will quickly 
converge to M if it outperforms other degrees. Finally, the thread number is not updated in every 
iteration. If the same thread number has been chosen by multiple consecutive tests, then the system 
will continue using it for a long time, until some retesting criterion is satished. 


C Proof of Theorem 1 


We assume that ||V£(u); x )||2 < G for any {w,x) G IF x £'. The theorem will be established if the 
following inequality holds: 

T m n T m n 

- L{w*)] < mE[\\wo - w*h - \\w'^ - w*h] + G'^ 

k=l i=l j=l k=l i=l j=l 

To see how inequality (21) proves the theorem, notice that the convexity of function L yields 

E[L(wj) - L{w*)] < 

Thus, inequality (21) implies 


Efc=i E™ 1 Ej=i vljHHwij) - L{w*)] 


2-jk=l Z-jj=l 'I'. 




k \2 


E[L{wj) - L{w*)] < 


mE[||u;o -u;*||2 - \\w-^ - Efc=i EI^i Ej=i(^*^j) 


2-^k=l 2-ji=l 2-jj=l lij 

By the assumptions on rjt, it is easy to see that the numerator of right-hand side is bounded, but 
the denominator is unbounded. Thus, the fraction converges to zero as T —)• oo. 

It remains to prove inequality (21). We prove it by induction. The inequality trivially holds for 
T = 0. For any integer A: > 0, we assume that the inequality holds for T = k — 1. At iteration k, 
every thread starts from the shared vector w^~^, so that w^i = w^~^. For any j G {1,... ,n}, let 
be a shorthand for ,;x). A bit of algebraric transformation yields: 


= - w 


k ^k ^k 

* l|2 I fry^k \2 


I<i +1 - ^11^ = mwiwfj - vyij) -w*\\i< Wwfj - -W*r 2 


^*I|2 
-k l|2 




where the inequality holds since w* and Yiw is the projection onto W. Taking expectation on 
both sides of the inequality and using the assumption that ||s'fj ||2 < G, we have 

E[|K-+i - < n\Hj - w*\\l] + G\4jf - 2gl^E[{wl^ - w*,VL{wl^))]. 

By the convexity of function L, we have {w^j — w*,VL{w^j)) > L{w^j) — L{w*). Plugging in this 
inequality, we have 

2gl^E[L{wl^) - L{w*)] < E[||u;,^^. - w*\\l] - E[||u;,^^.+i - w*\\l] + (22) 
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Summing up inequality (22) for i = 1,...,m and j = 1,... ,n, we obtain 

m n m 

- L{w*)] < mE[\\w^-^ - w*\\l] - ^ - w*\\l] 

i=l j=l 


2 = 1 


i=l j=l 


k \2 
i,j> ■ 


( 23 ) 


Notice that Thus, Jensen’s inequality implies YllLi Il'U^fn+i “ ^*111 — ~ ^*I| 2 - 

Plugging this inequality to upper bound (23) yields 

m n m n 

Y,Y.‘^vtjnL{wlj) - L{w*)] < mE[||u;"-i - w*\\l - \\w’^ - w*\\l] + ■ (24) 

2=1 j = l 2 = 1 j = l 

The induction is complete by combining upper bound (24) with the inductive hypothesis. 


D Proof of Theorem 2 

Recall that is the value of vector w after iteration k. Let be the output of thread i at the 
end of iteration k. According to the update formula, we have = IIb{^ YllLi ^i), where nB(-) 
is the projector to the set B. The set B contains the optimal solution w*. Since projecting to a 
convex set doesn’t increase the point’s distance to the elements in the set, and because that wf 
{i = 1,..., m) are mutually independent conditioning on , we have 


E[||u;^ -u;*||^] <E 


E 


- lit 


m 


i=l 


W 


k-l 


^ ^E[E[||u;f-u;*||i|u;^-i]] + —V 


E[E[{wf-w*,wY'w*)\w’^-^]] 


m 


2 ■ 
2=1 


m 


2 


m — 1 


= -E[|K-u;*||^] + 
m m 


i¥=j 


E[||EK|u;'=-V«^*lli] 


(25) 


Equation (25) implies that we could upper bound the two terms on the right-hand side respectively. 
To this end, we introduce three shorthand notations: 

Ofc := E[||u;*’ - 

bk :=E[||u;^-u;*||2], 

Cfc :=E[||E[u;f 

Essentially, equation (25) implies Ofc < ^bh + Let oq := — rc *||2 where is the initial 

vector. The following two lemmas upper bounds 6^+1 and Ck+i as functions of a^. We defer their 
proofs to the end of this section. 


Lemma 1. For any integer k > 0, we have 


bk+i < 




{k + iy 


;(kk + 


/3i 


{k -|- l) 2 re 


where /3i := 4.G^/X^ 
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Lemma 2. We have ci < and for any integer k'>l, 

P , 2^2v^ + /3i/n frnt7/MD 

Cfe+i < + {k + ifn — /32 := max I [ 2H/X\ R, 


8G2(L + G/p2) 

A3 


Combining equation (25) with the results of Lemma (1) and Lemma (2), we obtain an upper 
bound on oi: 


O-l <-1-O'— P3- 

mn 

Furthermore, Lemma (1) and Lemma (2) upper bound flfc+i as a function of o^: 


Ofc+l < 




O-k + 


{k + 1)2 

Using upper bounds (26) and (27), we claim that 

/^s + 2/32 v^/n 


Q'fc ^ 




/33 + 2/32 v^/n 
(A: + l)2 


for fc = 1, 2,, 


(26) 


(27) 


(28) 


By inequality (26), the claim is true for k = 1. We assume that the claim holds for k and prove it 
for k + 1. Using the inductive hypothesis, we have < (ds. Thus, inequality (27) implies 


afc+i < 




Ps + 2/32v^/n /33 + 2fd2\f^/n 


(A;+ 1)2 k (A: +1)2 (A: + l)n 

which completes the induction. Note that both /3i and /32 are constants that are independent of k, 
m and n. Plugging the dehnition of /33, we can rewrite inequality (28) as 


Q'fc ^ 


4G2 


+ 


Gi 


+ 


G2 


X^kmn kmfl’^nfl’^ kn^ 
where Gi and C 2 are constants that are independent of k, m and n. This completes the proof of 
the theorem. 


D.l Proof of Lemma 1 

In this proof, we use Wj as a shorthand to denote the value of vector w at iteration A: + 1 when 
the first thread is processing the j-th element. We drop the notation’s dependence on the iteration 
number and on the thread index since they are explicit from the context. Let gj = '\/i{wj‘,Xj) 
be the gradient of loss function i with respect to wj on the j-th element. Let rjj be the stepsize 
parameter when wj is updated. It is easy to verify that T]j = 

We start by upper bounding the expectation of _ -y;* ||2 conditioning on w^. By the strong 

convexity of L and the fact that w* minimizes L, we have 

{K[gj],Wj — w*) > L{wj) — L{w*) + — \\wj — w *\\2 ■ 

as well as 

L{wj) - L{w*) > ^ \\wj - w*\\l. 

Hence, we have 

{M[gj],Wj — w*) > A \\wj — w *\\2 (29) 

Recall that nw(-) denotes the projection onto set W. By the convexity of W, we have ||nw(it) — v \\2 < 


20 















||tt — f ||2 for any u,v £ W. Using these ineqnalities, we have the following: 

]E[||wj+i - w*\\l\w^] = E[\\Uwiwj - r]jgj) - w*\\2\w’‘] 

< E[||u;j - r]jgj - w*\\l\w'^] 

= E[\\wj-w*\\2\w’^]-2gjE {gj,Wj - w*)\w’^ + g‘jE[\\gj\\l\w’^] 
Note that the gradient gj is independent of Wj conditioning on w^~^. Thus, we have 


E 


{gj,Wj — w*)\w'‘ = E[(E[ 5 j],rcj — w*)\w'‘] > AE[||r(;j — w *\\2 


where the last inequality follows from inequality (29). As a consequence, we have 
E[||r(;j+i — rc*||2l'«^^] < (1 — 27/jA)E[||uij — rr’*||2|w^^] + VjG^- 
Plugging in gj = ^(kl+j) > obtain 

E|lk,+l - < (l - E||k, - w-|lik‘l + 


Case A: = 0: We claim that any j > 1, 


E[||n;j - u;*||2] < 


4G2 


(30) 


(31) 


Since wl = Wn+i, the claim establishes the lemma. We prove the claim by induction. The claim 
holds for j = 1 because inequality (29) yields 


_ yj*f < (]E[gi],wi " w*) ^ G||u;i - 'w;*||2 


A - A 

Otherwise, we assume that the claim holds for j. Then inequality (30) yields 

4 A 4G2 4G2 


Wl — tc*||2 < G/X. 


E[||n;,-+i-n;*||i]< (l-+ 




4G2 j - 4 + 1 

A2 j2 


< 


4G2 


A2(j + 1)’ 


which completes the induction. 


Case k > 0: We claim that for any j > 1, 

1 




\2\\„..k „..*\\2 , 4G^(j 1) 


{knY\\w'^ - w*\\2 + 


(32) 


{kn + j — 1)^ V 

We prove (32) by induction. The claim is obviously true for j = 1. Otherwise, we assume that 
the claim holds for j and prove it for j' + 1. Since 1 — ^ combining the inductive 

hypothesis and inequality (30), we have 

E[||t(;j+i - w*\\l\vY] 

1 


< 


/; \2|| k *\\2 I 1) 

[knj \\w — w II 2 H- ^^2 - 


{kn + JY 

1 fn \‘^ii k *\\2 I 

{kn + j)^ y ' " A 2 J 


4G2 

\^{kn + j)^ 
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which completes the induction. Note that claim (32) establishes the lemma since Wi = Wn+i- 


D.2 Proof of Lemma 2 

In this proof, we use wj as a shorthand to denote the value of vector w at iteration k + 1 when 
the first thread is processing the j-th element. We drop the notation’s dependence on the iteration 
number and on the thread index since they are explicit from the context. Let gj = Vi{wj‘,Xj) 
be the gradient of loss function i with respect to wj on the j-th element. Let gj be the stepsize 
parameter when wj is updated. It is easy to verify that gj = ^(kn+j) ' 

Recall the neighborhood Up dW in Assumption A, and note that 

Wj+i -w* = Uwiwj - gjgj - w*) 

= Wj - gjgj -w* + 0 Up) {I\-w{wj - gjgj) - {wj - gjgj)) 

since when w € Up, we have Iiw{w) = w. Consequently, an application of the triangle inequality 
and Jensen’s inequality gives 

||]E[iCj+i - w*\w ’']\\2 < IlIEftCj - gjgj - w*\w ’']\\2 

+ E \\{Uwiwj - gjgj) - (wj - gjgj))l{wj+i ^ I7p)||2 \w^ 

By the definition of the projection and the fact that Wj € W, we additionally have 

\\Uw{wj - gjgj) - (wj - gjgj)\\2 < \\wj - {wj - ?7j5'j))ll2 < Vj Hh ■ 

Thus, by combining the above two inequalities, and applying Assumption B, we have 
||E[u;j+i -u;*|u;^]||2 < \\E[wj-gjgj - w*\w^]\\2 + gjE[\\gj\\^l^^.^^^Up)\w^] 

< ||E[u;j - gjgj - w*\w’']\\2 + gjG • P{wj 0 Up\w^) 


< \\E[wj - gjgj - u;*|u ;^]||2 + gjG 




(33) 


where the last inequality follows from the Markov’s inequality. 

Now we turn to controlling the rate at which Wj — gj — gj goes to zero. Let ^j{■) = Xj) be a 
shorthand for the loss evaluated on the j-th data element. By defining 

Tj := gj - Vij{w*) - V^lj{w*){wj - w*), 

a bit of algebra yields 

gj = Vij{w*) + V'^ij{w*){wj — w*) + rj. 

First, we note that E[V£j{w*)\w’^] = VL{w*) = 0. Second, the Hessian V^£j(w*) is independent of 
Wj. Hence we have 

E[gj\w^] = E{V£j{w*)] + E[V^ij{w*)\w'^] • Efu;^ - w*\w'^]+E[rj\w'^] 

= V'^L{w*)E[wj - w*\w’^] + E[rj\w^]. 

Taylor’s theorem implies that Xj is the Lagrange remainder 

rj = {V^£j{w') - V^£j{w*)){w' - w*), 

where w' = awj -|- (1 — a)w* for some a G [0,1]. Applying Assumption B, we find that 


(34) 


E[||rj|| 2 |u;*’] < E[\\V^£j{w') - V^£j{w* 
< LE[\\wj - w*\\l\w% 


I 2 \\wj -'u;*|| 2 |u;'=] 


(35) 
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By combining the expansion (34) with the bound (35), we find that 

\\E[wj - ijjQj - = E[(/ - r]jV^Fo{w*)){wj - w*) + r]jrj\w^] 

< 11(7 - r]jV^L{w*))E[wj - u;*|u ;^]||2 + VjLE[\\wj - w*\\l\w'^]. 

Using the earlier bound (33) and plugging in the assignment rjj = this inequality then 

yields 


IIEK -+1 - r/;*|u ;'=]||2 < ||7 - r]jV‘^Liw*)\\2 


+ -“’■1®“'“! + 
Next, we split the proof into two cases when 7 = 1 and 7 > 1. 


GE[||tCj+i — r(;*||2|w 


(36) 


Case 7 = 0: Note that by strong convexity and our condition that ||V^L(t (;*)||2 < 77, whenever 
r]jH < 1 we have 

||7 - 7JjV^L(w*)ll2 = 1 - VjXmin{^^L{w*)) < 1 " Vj>^ 


Define tq = [277/A]; then for j > tq, we have rjjH <1. As a consequence, inequality (31) (in the 
proof of Lemma 1) and inequality (36) yield that for any j > tq, 

0^2 

IIEK+l - »*lll2 < (1 - 2/j) ||EK- - w-]||, + {i + G/p") . (37) 

As shorthand notations, we define two intermediate variables 

q/^2 

ut = ||E(r(;j - and bi = (L + G/p^) . 

Inequality (37) then implies the inductive relation 

Uj+i < (1 - 2/j)uj + bi/f for any j > tq. 


Now we claim that by defining 62 := max{ro7?, 61 }, we have Uj < fi/j. Indeed, it is clear that 
Uj < tqR/ j for j = 1, 2,..., Tq. For t > tq, using the inductive hypothesis, we have 


(1 - 2/j)b2 , bi 

Uj + l < -^-® ^ < 

J r 


62J — 262 + ^2 


- 1) < h 
f “ J + 1 ’ 


This completes the induction and establishes the lemma for 7 = 0. 


Case 7 > 0: Let Uj = ||E[tCj — u;*|u ;^]||2 and 6 = \\w^ — rc *||2 as shorthands. Combining inequal¬ 
ity (32) (in the proof of Lemma 1) and inequality (36) yield 


Uj+i S 


< 1 - 


< 1 - 


kn -|- j 
2 

7n -|- j 


Uj -|- 


Uj + 


2(7/ -I- G/p^) 


\{kn + j){kn + j — 1)^ 
2(L + G/p2) 


{knys^ + 


\{kn 

{kn + j — 2)(7n + j — 1) 
{kn + j — l){kn + j) 


Uj -|- 


j){kn + j — l)7n 
biknd^ + 62/7 


{knyS^ + 


AG^j\ 
A2 J 

AG^n 


A2 


[kn + j — l)(7n -|- j) 


(38) 
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where we have introduced shorthand notations bi := ) ^nd 62 := ^ • With these 


notations, we claim that 


Uj < 


{kn — l)kn6 + (j — l){biknd‘^ + b 2 /k) 


(39) 


{kn + j — 2){kn + j — 1) 

We prove the claim by induction. Indeed, since ui = 6, the claim obviously holds for j = 1. 
Otherwise, we assume that the claim holds for j, then inequality (38) yields 

{kn — l)knd + {j — l){biknd‘^ + 62 /fc) bikn6‘^ + 62 /fc 


Uj+l < 


{kn + j — l){kn + j) 
{kn — l)knd + j{bikn5‘^ + 62 /A:) 


+ 


{kn + j — l)(A:n + j) 


(A:n + j — l)(A:n + j) ’ 

which completes the induction. As a consequence, a bit of algebraic transformation yields 

{kn — l)knd + n{bikn5'^ + 62 /A:) 


|| EK +^- u ;*| u ;'=]||2 = «„+!< 

k'^n^S 


< 


< 


{{k + l)n — 1)(A: + l)n 
77,62/A: 


nb-iknS"^ 

+ , ' + 


{k + kn{k + 1)77 kn{k + 1)77 


k 


k + 1 

k I 


5 + t^ + 


62 


A: + 1 k{k + 1)77 


k6 + ^ 


k + 1 


616^ 62 

A: + 1 k'^n 


(40) 


By the fact that & B, we have ^^616 < ^^ 6 iD < 1. Thus, inequality (40) implies 


Taking expectation on both sides of the inequality, then applying Jensen’s inequality, we obtain 
E[||E[77;^+^ 

Hence, the lemma is established. 


|2J < 


eE[6‘^] ^ 262 V^Ei^ + bl/n 


(A:+ 1)2 


{k + 1)^77 
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